Using version 3 of the MN output.
Subsequence:
1. check_parity: there is only one parity in house numbers (either odd or even)
2. check_direction: house number is only increasing/decreasing
3. check_street: there is only one street name
4. within specified jump size
Merged sequence:
1. there is only 1 street name (only when check_street is set to TRUE)
2. there are no more than 2 house numbers whose parities are different from the rest of the house numbers in the sequence
3. Adjacent house numbers to not differ for more than 10. The number can be set by setting jump_size.
| st_on | st_off | |
|---|---|---|
| min | 1 | 1 |
| mean | 26.70967 | 21.02628 |
| median | 6 | 5 |
| max | 4945 | 4941 |
| count | 3744 | 4756 |
Quick recap:
- Why are there NA house numbers: fill down on cleaned house numbers (output of 04 clean and 05 fill down) was done using ED and best match. So NAs will result (most likely) from having a dissimilar best_match above/below or (unlikely) a dissimilar ED.
- also note that best_match itself was filled down: hence, these NA house numbers had a raw street address attached to them, but did not have any house numbers
- How did sequence generation deal with NAs: In the sequence generating function, sequences were generated using a df that had NAs removed. Then, to attach a SEQ to it, fill down then up was done (without any restriction on street).
A look at all the sequences with NAs:
There are two types of sequences containing NAs.
Type 1: NAs occur at the end of the sequence, suggesting that the non-NA sequence before and the NA sequence after are distinct sequences. An example:
| street_add | best_match | result_type | house_num | hn_1 | hn_2 | hn_3 |
|---|---|---|---|---|---|---|
| E 98 ST | E 90 | 2 | 200 | 200 | NA | NA |
| E 98 ST | E 90 | 2 | 200 | 200 | NA | NA |
| E 98 ST | E 90 | 2 | 200 | 200 | NA | NA |
| E 98 ST | E 90 | 2 | 200 | 200 | NA | NA |
| E 98 ST | E 90 | 2 | 200 | 200 | NA | NA |
| E 98 ST | E 90 | 2 | 200 | 200 | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
| E 93 | E 93 | 1 | NA | NA | NA | NA |
Type 2: NAs occur between two sequences of house numbers, i.e. there is a block of streets with no house number sandwiched between 2 sequences that would’ve been joined. An example:
| street_add | best_match | result_type | house_num | hn_1 | hn_2 | hn_3 |
|---|---|---|---|---|---|---|
| KINGSBRIDGE ROAD | HAWTHORNE | 4 | 4850 | 4850 | NA | NA |
| KINGSBRIDGE ROAD | HAWTHORNE | 4 | 4850 | 4850 | NA | NA |
| KINGSBRIDGE ROAD | HAWTHORNE | 4 | 4850 | 4850 | NA | NA |
| HANTHORNE ST | HAWTHORNE | 2 | 4850 | 4850 | NA | NA |
| COOPER ST | COOPER | 1 | NA | NA | NA | NA |
| COOPER ST | COOPER | 1 | NA | NA | NA | NA |
| COOPER ST | COOPER | 1 | NA | NA | NA | NA |
| COOPER ST | COOPER | 1 | NA | NA | NA | NA |
| HAWTHORNE ST | HAWTHORNE | 1 | 4850 | 4850 | NA | NA |
| HAWTHORNE ST | HAWTHORNE | 1 | 4850 | 4850 | NA | NA |
Could this be useful in further street name cleaning?
| street_add | best_match | result_type | house_num | hn_1 | hn_2 | hn_3 |
|---|---|---|---|---|---|---|
| BAYTES ST | BAXTER | 2 | 79 | 79 | NA | NA |
| BAYTES ST | BAXTER | 2 | 79 | 79 | NA | NA |
| BAYTES ST | BAXTER | 2 | 79 | 79 | NA | NA |
| BAYTES ST | BAXTER | 2 | 79 | 79 | NA | NA |
| BAYTES ST | BAXTER | 2 | 79 | 79 | NA | NA |
| BAYTES ST | BAXTER | 2 | 79 | 79 | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BANTAS STREET | CANAL | 3 | NA | NA | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER ST | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 79 | 79 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| BAXTER STREET | BAXTER | 1 | 81 | 81 | NA | NA |
| type | count | freq |
|---|---|---|
| 1 | 24 | 0.75 |
| 2 | 8 | 0.25 |
Perhaps not really. Esp since some may not be confident changes (ie if original match already very confident).
When creating merged sequences, we can decide if sequences should have the same street name or if this rule can be relaxed. We know that if this rule is specified, more sequences are generated- which suggests that without taking into account street name, some sequences have more than 1 street name. It could be useful to check if these sequences can identify errors in the street name cleaning process.
When determining if a street has been wrongly matched, we could use the following steps:
1. Look at sequences (generated with check_street off) with multiple street names in them
2. If any of the multiple street names are of a non-1/2 result type, use the predominant street name in the sequence instead
- this works because when filling down, we used the select the most similar string from a pool of 3 above/below records. for nonsensical street names, this approach may not be ideal.
3. Determine if the different street names are close in spatial proximity: as enumerator may have cross an intersection - if they are close, leave it, else change to predominant street name (archived)
| street_add | best_match | result_type | hn_1 |
|---|---|---|---|
| MOTT ELIZEBETH | MOTT | 4 | 72 |
| MOTT ST | MOTT | 1 | 72 |
| BAYARD | BAYARD | 1 | 66 |
| APEL | WALKER | 3 | 66 |
| BAYARD ST | BAYARD | 1 | 66 |
| BAYARD ST | BAYARD | 1 | 70 |
| MOTT | MOTT | 1 | 72 |
| street_add | best_match | result_type | hn_1 |
|---|---|---|---|
| EAST | 1 AVE | 3 | 301 |
| EAST | 1 AVE | 3 | 303 |
| EAST | 1 AVE | 3 | 307 |
| EAST | 1 AVE | 3 | 309 |
| EAST | 1 AVE | 3 | 311 |
| 99TH STREET EAST | E 99 | 3 | 311 |
| EAST 99TH STREET | E 99 | 2 | 311 |
| street_add | best_match | result_type | hn_1 |
|---|---|---|---|
| ST WEST | W 162 | 5 | 1052 |
| ST NICHOLAS AVE | NICHOLAS AVE | 1 | 1054 |
| ST NICHOLAS AVE | NICHOLAS AVE | 1 | 1056 |
| ST NICHOLAS AVE | NICHOLAS AVE | 1 | 1058 |
| ST NICHOLAS AVE | NICHOLAS AVE | 1 | 1064 |
| ST NICHOLAS AVE | NICHOLAS AVE | 1 | 1066 |
| ST NICHOLAS AVE | NICHOLAS AVE | 1 | 1072 |
| ST NICHOLAS AVE | NICHOLAS AVE | 1 | 1074 |
Current criteria: ‘faulty’ match is result type > 2, above and below are good matches (1-2), house numbers do not differ by more than 4 in ‘faulty’ match, above and below is the same if they exist (if exist because some faulty row may be the start of a seq).
## [1] 38
But!! Some errors still:
| street_add | best_match | result_type | hn_1 |
|---|---|---|---|
| FIRST AVENUE | 1 AVE | 1 | 400 |
| E 91 ST | E 91 | 1 | 404 |
| EAST 91 STREET | E 91 | 3 | 404 |
For situations where it’s possible the the multiple street names are correct:
If 30 Roosevelt is near 38 New Bowery, this could be correct and left alone
But this can identify errors. E.g. we know from manual checking that Pearl was somehow mistranscribed here. If Pearl and Madison are far apart, this process would be able to correct that:
Note: we should be quite strict about this process as we do not want our error rate to increase unnecessarily. More EDA needs to be done to check if this process is worth carrying out.